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Abstract 



5-H . We consider the design of efficient algorithms for a multicore computing environment with 

^ ' a global shared memory and p cores, each having a cache of size M, and with data organized 

in blocks of size B. We characterize the class of 'Hierarchical Balanced Parallel (HBP)' multi- 
threaded computations for multicores. HBP computations are similar to the hierarchical divide 
, & conquer algorithms considered in recent work, but have some additional features that guaran- 

tee good performance even when accounting for the cache misses due to false sharing. Most of 
our HBP algorithms are derived from known cache-oblivious algorithms with high parallelism, 
however we incorporate new techniques that reduce the effect of false-sharing. 

Our approach to addressing false sharing costs (or more generally, block misses) is to ensure 
C/3 ' that any task that can be stolen shares 0(1) blocks with other tasks. We use a gapping technique 

for computations that have larger than 0(1) block sharing. We also incorporate the property 
of limited access writes analyzed in |13| , and we bound the cost of accessing shared blocks on 
the execution stacks of tasks. 

We present the Priority Work Stealing (PWS) scheduler, and we establish that, given a 
, sufficiently 'tall' cache, PWS deterministically schedules several highly parallel HBP algorithms, 

' including those for scans, matrix computations and FFT, with cache misses bounded by the 

, sequential complexity, when accounting for both traditional cache misses and for false sharing. 

We also present a list ranking algorithm with almost optimal bounds. PWS schedules without 
[ using cache or block size information, and uses knowledge of processors only to the extent of 

determining the available locations from which tasks may be stolen; thus it schedules resource- 
obliviously. 



1 Introduction 

We consider the efhcient scheduling of multithreaded algorithms [TJj in a multicore computing 
environment. We model a multicore as consisting of p cores (or processors) with an arbitrarily 
large shared memory, where each core has a private cache of size M. Data is organized in blocks of 
size B, and the initial input of size n is in the main memory, in n/B blocks. Recently, there has been 
considerable work on developing efficient algorithms for multicores [iHl El EJ [101 ttH El EOl HI llS] ; 
many of these algorithms are multithreaded. An efficient multicore algorithm attempts to obtain 
both work-efficient speed-up as well as cache-efficiency. However, none of these prior results have 
addressed false sharing costs when considering cache-efficiency. 
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Cache Misses, False Sharing, and Block Misses. When a core C needs a data item x that 
is not in its private cache, it reads in the block /3 that contains x from main memory at the cost 
of one cache miss. This new block replaces an existing block in the private cache, which is evicted 
using an optimal cache replacement policy (LRU suffices for our algorithms). If another core C 
modifies an entry in block f3, then with cache coherence, f3 is invalidated in C"s cache, and the next 
time core C needs to access data in (3, an updated copy of (3 is brought into Cs cache. In the 
absence of cache coherence, some method of assigning ownership to a block is needed so that the 
updates to items in a block are correctly performed. For concreteness we will assume the above 
cache coherence protocol. 

The delay caused by different cores writing into the same block can be quite significant, and 
this is a caching delay that is present only in the parallel context. In particular, consider a parallel 
execution in which two or more cores between them perform multiple accesses to a block /3, which 
include x > 1 writes. These accesses could cause 0(6 • x) delay at every core accessing /?, where 
b is the delay due to a single cache miss. These costs might arise if two cores are sharing a block 
(which occurs for example if data partitioning does not match block boundaries) or if many cores 
access a single block (which could occur if the cores are all executing very small tasks). Further, 
X can be arbitrarily large unless care is taken in the algorithm design. We refer to any access of a 
block that is not in cache due to the block being shared by multiple cores as a block miss. 

False-sharing is a common example of a block being shared across cores with the result that 
block misses occur. This term is usually applied to the case when two cores write into different 
segments of an array where the two segments share a block. In this case, as mentioned above, each 
of the two cores may incur Q{B) cache misses when writing their portion of this block due to the 
block 'ping-ponging' between the two cores. We use the term block miss in this paper as a more 
general term to include all types of delays due to block-sharing. 

Schedulers and Resource Obliviousness. We consider multithreaded algorithms that expose 
parallelism by forking (or spawning) tasks, but make no mention of the processor (or core) that 
needs to execute any given task, and do not tailor the size of the task being executed to the 
cache size or block size. Such an algorithm is called a multicore- oblivious algorithm in [11]. The 
multicore algorithms in [9l O [ini [H] are all multicore-oblivious; however, these papers design 
run-time schedulers which use their knowledge of cache and task sizes in order to obtain efficient 
multicore performance. 

In this paper, we use the term resource- obliviousness to refer to an execution of a multicore- 
oblivious algorithm by a scheduler that does not use knowledge of cache sizes or other parameters 
of the multicore in its execution. Earlier examples of resource-oblivious algorithms are in |18[ [6]. 
which use RWS (the randomized work-stealing scheduler), but these do not address the cost of 
false sharing. However, it appears that if we want resource oblivious execution, we must consider 
the effect of block misses, since we cannot avoid steals of tasks of size smaller than B unless the 
scheduler knows the block size. But if such small tasks are stolen when writes occur within such 
tasks, block misses can be expected to occur. The bounds obtained in [I8l [6] for RWS are weaker 
than those we obtain for our scheduler, PWS, even if we ignore block misses. In a companion paper 
|13j we analyze the performance of RWS considering both cache and block misses. 

Our Contributions. The contributions of this paper are two-fold. 

1. First, we set up a framework to analyze - and optimize for - the caching overhead of both 
cache misses and block misses while also allowing for high parallelism. 

We identify a basic primitive, the balanced parallel (BP) computation. The BP computation is 
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the basic building block in Hierarchical Balanced Parallel (HBP) computations, which are obtained 
through sequencing and parallel recursion. HBP computations have the limited access property for 
writes, which requires that any variable is written at most a constant number of times. We present 
techniques for reducing the cost of block misses, such as 0{l)-block sharing, and gapping. We also 
use a result presented in our companion paper [13] which bounds the number of access to any given 
block in a class of algorithms with limited access writes that includes HBP computations. 

HBP computations are similar to the Hierarchical Divide and Conquer (HD&C) class in [5]. 
But they have important differences, notably the limited access requirement. We analyze HBP 
algorithms for scans, matrix transposition (MT), Strassen's matrix multiplication, converting a 
matrix between row major (RM) and bit interleaved (BI) layouts, FFT, list ranking and graph 
connected components. Most of these are known algorithms, but some are new and others are 
modified to conform to HBP and to achieve low block miss cost. 

On the algorithm analysis side, our analyses are in terms of certain structural parameters of 
the algorithms. On an input of length n, these include the work W{n), the depth or critical path 
length Too, the cache complexity in a sequential execution Q{n, M, B), and new parameters: /(r), 
the cache- friendliness, and L{r), a block sharing measure. It suffices to determine these parameters 
in order to analyze the algorithms. Further, the algorithm design problem can focus on minimizing 
these parameter^. 

2. Our second contribution is a new deterministic work-stealing scheduler, the Priority Work- 
Stealing Scheduler (PWS), which is tailored to perform well on HBP computations. It achieves a 
lower caching overhead due to steals than the bounds derived for RWS in [I8l [6] for the case when 
block misses are not considered, and in our companion paper |13j when both cache and block misses 
are considered. For most of the algorithms we consider, we obtain optimal cache miss overhead 
with a sufficiently tall cache, considering both cache and block misses, when the input is larger 
than the combined sizes of the caches. PWS is also a deterministic scheduler, for which we give a 
reasonably simple distributed implementation. 

2 Computation Model 

We consider a class of multithreaded parallel computations that expose their parallelism through 
binary forking of parallel tasks (see, e.g., [14], Chapter 27). Parallel tasks are scheduled on cores 
using a work stealing scheduler. 

The basic unit in our formulation is a computation tree T with binary forking of tasks, which 
forms the downpass of the computation. The downpass is followed by an up-pass on a reverse 
tree where two forked tasks join, and once the execution of the two forked tasks is completed, the 
computation is continued by the task that forked them. A simple example, M-Sum, is shown below, 
which computes the sum of the n elements in array A. 

M-Sum(^[l..n], s) % Returns s = J27=i 

if n — 1 then return s :— A[l] fi 
fork(M-Sum(A[l..n/2], si); M-Sum(A[f + l..n], sa)) 
return s = si + S2 

Initially the root task for M-Sum's computation is given to a single core. This root task 
corresponds to the entire computation of M-Sum on array 74[l..n]. In a sequential execution, this 

^ We note that the recent sorting algorithm in 12; uses the scheduhng bounds developed here in its analysis. 
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computation proceeds by ignoring fork commands. In a work stealing multicore execution, subtasks 
are acquired by other cores via task stealing. To this end, each core C has a task queue. It adds 
forked tasks to the bottom of the queue, while tasks are stolen from the top of the queue. So in 
particular, when C, on executing a task r, generates forked tasks ti and T2, it places T2 on its 
queue, and continues with the execution of ri. When C completes ri, if T2 is still on its queue, 
it resumes the execution of T2, and otherwise there is nothing on its queue so it seeks to steal a 
new task. The core executing the last of ri and T2 to finish will complete the execution of r in the 
up-pass. While C is executing task r other cores will be acquiring work by stealing tasks from C's 
task queue, and then in turn will be generating their own task queues from which further subtasks 
can be stolen. We will refer to the portion of task r executed by C as the kernel of task r. 

We have outlined the mechanism of work-stealing for a simple tree-structured computation. 
However, work-stealing applies to general DAG-structured computations, and randomized work 
stealing has been analyzed for DAGs represented by series-parallel graphs and more general struc- 
tures (e.g., 13 H]). In this paper, we consider algorithms whose computation DAG represents a 
Hierarchical Balanced Parallel (HBP) computation, which is defined in Section [3.21 and we estab- 
lish that they perform very well when executed under the Priority Work Stealing (PWS) scheduler, 
which we introduce in Section HI 

2.1 Cache Misses 

Work stealing causes execution of a multithreaded algorithm to incur additional cache misses over 
those incurred in a sequential execution, and it also introduces block misses. We introduce two pa- 
rameters to express these costs, the cache-friendliness function /(r) and the block-sharing function 
L(r). We introduce /(r) here. We define L{r) and discuss block misses further in Section [2.21 

Definition 2.1. A collection ofr words of data is f -cache friendly if they are contained in 0{r/B + 
f{r)) blocks. An HBP computation is f -cache friendly if for every task r in the computation, the 
sequence of words accessed by r is f{\T\)-cache friendly. 

For instance, /(r) = 1 if r accesses an array stored in contiguous locations; if r access a ^/rx \/r 
submatrix of a matrix stored in RM (i.e., row major), then /(r) = ^/r. 

Let r be a task in a multithreaded algorithm A. We will use the size of r to denote the number 
of words accessed by r. If r is stolen by a core C, then C incurs additional cache misses over the 
sequential execution, since it will have to read the possibly previously read data needed to execute 
r. Define Qr to be the number of cache misses incurred by task r in the sequential execution of A. 
The excess cache miss caused by the steal of r is defined to be the number of cache misses incurred 
by C in its execution of r minus c • Qr, for a suitable constant c > 1. 

A stolen task of size M could have incurred no cache misses in an execution in which it was not 
stolen, but once the size of the stolen task reaches 2M, its execution when not stolen would incur 
at least M/B cache misses. The following lemma makes this precise. 

Lemma 2.1. A stolen task r incurs at most 0(min{^,J^} -|- /(|t|)}) additional cache misses 
compared to the steal-free sequential computation. If f{\T\) = 0{\t\/B) and \t\ > 2M , this is an 
excess of cache misses. 

Proof. In the sequential execution of the algorithm, the execution of r incurs at least Q^- = 
max{0, ^ — ^} cache misses since |t| data has to be accessed, of which at most M is in cache. C's 
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execution of r incurs 0{-^ + /(|t|)) = 0{Qr + min{J^, ^} + /(|t|)) cache misses. For |r| > 2M, 
Qr > and if /(|r|) = 0{\t\/B), then min{M, M| ^ ^(|^|) ^ ^(g^)^ , 

2.2 Block Misses 

We discuss here our basic set-up for coping with block misses. As mentioned earher, we assume 
that block misses are handled under a cache coherence protocol whereby a write into a location in 
a shared block /3 by core C invalidates the copy of (3 in every other cache that holds (3 at the time 
of the write. This is done so that data consistency is maintained within the elements of a block 
across all copies in caches at all times. There are other ways of dealing with block misses (see, 
e.g., [l9j). but we believe that the block miss cost with our invalidation rule is likely as high as (or 
higher than) that incurred by other mechanisms. Thus, our upper bounds should hold for most of 
the coping mechanisms known for handling block misses. 

A block miss occurs at a core C when it has a block /3 which it shares with one or more 
other cores, and it needs to read (3 again because another core wrote into a location in /3, thereby 
invalidating the copy of (3 in C's cache. The cost of such a block miss is at least that of one cache 
miss, but it could be much larger, depending on the number of cores that share (3 and write into it; 
in fact, the cost of a block miss could be unbounded in a scenario where several cores repeatedly 
write into locations in the block, if the system mechanism for transferring access to the block 
does not ensure fairness. Our analysis for bounding the cost of block misses does not make any 
assumptions about the mechanism used for transferring accesses to a shared block under writes. 
Hence the bounds we obtain are truly worst-case. 

Definition 2.2. Suppose that block f3 is moved m times from one cache to another {due to cache or 
block misses) during a time interval T = [^1,^2]- Then m is defined to be the block delay incurred 
by /3 during T. 

The block wait cost incurred by a task t on a block (3 is the delay incurred during the execution 
of T due to block misses when accessing (3, measured in units of cache misses. 

Note that the block wait cost incurred by a task r on a block f3 is the delay incurred as measured 
in units of cache misses. Clearly, the block delay of a block (3 during a time interval T is an upper 
bound on the block wait cost incurred by any task on block f3 during T. 

We now define L-block sharing. 

Definition 2.3. A task r of size r is L-block sharing, if there are 0{L[r)) blocks which r can 
share with all other tasks that could be scheduled in parallel with r and could access a location in 
the block (these other tasks do not include subtasks ofr). A computation has block sharing function 
L if every task in it is L-block sharing. 

The following definition is from [T3] 

Definition 2.4. 113'] An algorithm is limited-access if each of its writable variables is accessed 
0(1) times. 

The two main algorithmic techniques that we use to reduce the cost of block misses are to 
enforce 0(l)-block sharing and the limited access. In some of the algorithms, we also use a gapping 
technique to reduce the block miss cost. We will also assume the following system property. When- 
ever a core requests space it is allocated in block sized units; naturally, the allocations to different 
cores are disjoint and entail no block sharing. 
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3 HBP Computations and Algorithms 



Balanced parallel (BP) computations, defined below, form the backbone of om' HBP algorithms. 
Recall that the size \t\ of a task r as the amount of data accessed by r. Note that the size of a 
task is a positive integer. 

It will also be helpful to specify the notions of local and global variables. 

Definition 3.1. A variable x declared in a procedure P is called a local variable of P. A variable 
y accessed by P and declared in a procedure Q calling P or used for the inputs or outputs of the 
algorithm A containing P is said to be global with respect to P. However, note that y would be a 
local variable of Q if declared in Q . 

Definition 3.2. A BP computation tt is a limited access algorithm that is formed from the downpass 
of a binary forking computation tree T followed by its up-pass, and satisfies the following properties. 

i. A task that is not a leaf performs only 0(1) computation before it forks its two children in the 

downpass of the computation. 

ii. In the up-pass each task performs only 0(1) computation after the completion of its forked 

subtasks. 

iii. Each leaf node performs 0(1) computation. 

iv Each node declares at most 0(1) local variables. 

V. vr may also use size 0{\T\) global arrays for its input and output. 

vi. Balance Condition. Let the height of T is h; let the root task, which is at level in T, have 
size r; let a be a constant less than 1; and let ci, C2 be constants with ci < 1 < C2. Then, the 
size of any task r at level i in T satisfies ci • a* • r < |r| < C2 • a* • r. 

The task head of a task r is the computation it performs in part (i) in the above definition. 

This definition of a BP computation requires sibling tasks to have essentially the same size to 
within a constant factor. However, a computation in which these sizes are upper bounds on the 
actual size is sufficient for our results, as long as this upper bound on the actual size is what is 
used to compute the resource bounds. Also, note that any BP computation will have a > 1/2; all 
of our algorithms have a = 1/2. 

Later, in Section [4.71 in order to reduce the block wait costs in our scheduler implementation, 
we will employ a variant of BP computations which we call padded BP computations. 

Definition 3.3. A padded BP computation is a BP computation in which each node v in the 
down-pass declares an array: let v corresponds to the start of a subtask r; then v 's array is of size 

These arrays are present to ensure that the space used by successive nodes to store their variables 
(other than the new array) are well separated; this is what enables a reduction in block wait costs. 

In Section 1321 we will use the following observation on the nature of stolen tasks in a BP compu- 
tation under work-stealing. (This observation holds more generally for series-parallel computation 
dags.) 
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Observation 3.1. Let D he the computation dag for a BP computation 11 executing at a core C 
under work stealing. Let v be the node in D corresponding to the last task Ty that was stolen from 
C while it was executing II, and let P be the path in D from the root of D to the parent of v. 
Then, the set of tasks stolen from C during its execution of U consists of some or all of the tasks 
corresponding to those nodes of D that are the right child of a node in P but are not themselves on 
P. Further, they are stolen in top-down order with respect to the path P. 

3.1 HBP Computations 

We now define the class of HBP Computations. 

Definition 3.4. A Hierarchical Balanced Parallel Computations (HBP) is a limited access algo- 
rithm that is one of the following: 

1. A Type Algorithm, a sequential computation of constant size. 

2. A Type 1, or BP computation. 

3. A Type i + 1 HBP, for i>l. An algorithm is a Type i + 1 HBP if, on an input of size n, it 
calls, in succession, a sequence of c> 1 collections of v{n) > 1 parallel recursive subproblems, 
where each subproblem has size s{n) < n/b[n), with b[n) > 1; further, each of these collections 
can be preceded and/or followed by calls to HBP algorithms of type at most t. 

Data is transferred to and from the recursive subproblems by means of variables (arrays) 
declared at the start of the calling procedure. 

4. A Type max{ti,t2} HBP computation results if it is a sequence of two HBP algorithms of 
types ti and t2 . 

A Padded HBP computation is an HBP computation in which each BP subcomputation is 
padded. 

Definition 3.5. An HBP computation of type t > 1 is balanced if the recursive problems at each 
level of recursion all have sizes within a constant factor of each other. 

For convenience, we will assume this constant factor in balanced HBP computations to be C2/C1, 
where ci and C2 are the constants in Definition 13.21 

All HBP algorithms we consider here are balanced, but the sorting algorithm SPMS in our 
recent paper [12] is unbalanced. 

The HBP class is closely related to the Hierarchical Divide and Conquer (HD&C) class in [5] 
(after the parallelism is exposed in the HD&C algorithms). The HD&C class was used in for a 
3-level cache hierarchy with a special scheduler that is not oblivious to cache parameters. The main 
differences between HBP and HD&C are that we allow sequencing of HBP computations even at 
the top level, and we do not restrict the number of subproblems that are called recursively to be 
bounded by a constant; on the other hand we restrict the computation to be limited access. 

Forking recursive tasks. The recursive forking of v{n) parallel tasks in an HBP computation 
is incorporated into the binary forking in our multithreaded set-up by a BP-like tree of depth 
log2'y(n). All nodes at a given level have the same number of recursive subproblems, to within a 
constant factor. Each leaf of this tree is a recursive subproblem. By its construction such a BP-like 
tree will have a = 1/2 in a balanced HBP. 
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Algorithm 


Type 


f(r) 


L{r) 


W{n) 




Q{n,M,B) 


Scans (MA, PS) 


1 


1 


1 


0{n) 


0(log n) 


0{n/B) 


MT 


1 


1 


1 


0{n') 


0(log n) 


0{n/B) 


Strassen 


2 


1 


1 


0{n^) 


0(log^ n) 


n^/{B ■ M^-i) 


RM to BI 


1 


y/r 


1 


0{n') 


0(log n) 


0(nV5) 


Direct BI to RM 


1 


yjr 




0{n') 


0(log n) 


Oin'/B) 


BI-RM (gap RM) 


1 




gap 


O(n^) 


0(log n) 


0{n^/B) 


BI-RM for FFT 


2 


\fr 


1 


0(n^ log log n) 


O(logn) 


O(^logMn) 


FFT 


2 


y/r 


1 


0{n log n) 


0(logn • log logn) 


0(f logA/n) 


LR 


3 


y/r 


gap 


0{n log n) 


0{log'^ n ■ log log n) 


0{^\og,,n) 


CC 


4 


y/r 


gap 


0(n log^ n) 


0(log^n • log^^^ n) 


0{^\ogMn-\ogn) 


Depth-n-MM [13j 


2 


1 


1 


0(n3) 


0{n) 


n^/[B^/M) 


Sort [12J 


2 




1 


0{n log n) 


0(logn ■ log logn) 


O(^logMn) 



Table 1: Basic parameters of the HBP algorithms we analyze. Type refers to the HBP type, /(r) is 
the cache-friendliness function, and L{r) is the block-sharing function. The bounds with /(r) = y/r 
assume a tall cache. The input size is n, except for matrix computations, where the input size is 
71^. For completeness, we include the known bounds for work {W{n))^ critical pathlength (Tqo), 
and sequential cache complexity {Q{n)). 

3.2 HBP Algorithms 

Our results in for PWS in Section [J] establish that L{r) = 0{\) is desirable, while /(r) = 0{^/r) 
suffices if have a standard tall cache M > B^. Table 1 lists the HBP algorithms that we present 
and analyze in this paper. Most of these algorithms are adapted from known HD&C algorithms. 
All of them are limited access and have /(r) = 0{^/r)] for Depth-n-MM, the original algorithm 
in [IT] is not limited access, but it is converted to being limited access by using local arrays for 
copying in [13]. Many of these algorithms also inherently have L{r) = 0(1) (e.g.. Scans, MT 
(Matrix Transposition) and Strassen (Matrix Multiplication)), while others are modified through 
the gapping technique to reduce the block miss cost. 

Scans, including M-SuM seen earlier, and MA (Matrix Addition) [5] can be implemented 
as a single BP computation. Prefix sums (PS) can be implemented as a sequence of two BP 
computations, where the first BP computation computes sums of disjoint subarrays of size 2*, for 
i < logn, and the second BP computation computes the final output. These are type 1 HBP 
computations with /(r) = 0(1), L{r) = 0(1). 

Matrix Computations. For matrix computations, we assume that the matrix is in the bit 
interleaved (BI) layout, which recursively places the elements in the top- left quadrant, followed by 
recursively placing the top-right, bottom-left, and bottom-right quadrants. The advantage of the 
BI layout is that it results in BP tasks that are 0(l)-friendly, and have 0(l)-block sharing, which 
allows us to obtain good cache and block miss bounds. We describe several methods to convert 
between the standard row major (RM) layout and BI; these methods can be used in conjunction 
with our algorithms for BI if the input and output matrices are to be in RM. 
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MT is matrix transposition when the n x n matrix is given in the BI layout. When we expose 
the parallelism in the recursive algorithm in p2| we obtain a BP computation with /(r) = 0(1) 
and L(r) = 0(1). 

Strasseno We expose the parallelism in Strassen's matrix multiplication algorithm that mul- 
tiplies two n X n matrices by recursively multiplying seven n/2 x n/2 matrices, and performs the 
matrix additions for the divide and combine steps using MA. This results in an HBP computation 
that is of type 2, with c = 1 collection oi v = 7 subproblems of size s(m) = m/4, where m = n? 
is the size of the matrix. This algorithm computes the 7 recursive submatrices in new subarrays. 
These matrices are then combined with matrix additions and subtractions (performed using MA) 
according to Strassen's algorithm, and the final four submatrices are written back to the four quad- 
rants in the parent matrix. Thus each variable in this algorithm is written only a constant number 
of times, and the algorithm is inherently limited access. When the matrices are in the BI layout, 
this computation has /(r) = 0(1) and L(r) = 0(1). The sequential cache complexity is Qi ^j^-y ), 
where A = log2 7 and 7 = (A/2) — 1. 

Since we have assumed in the above algorithms that matrices are in the BI layout, we need 
methods to convert between the traditional RM (row major) layout and the BI layout. It turns out 
that RM to BI is easy to execute with 0(1) block-sharing, while BI to RM requires more effort. 

RM to BI. We use a simple BP computation that recursively converts each quadrant in parallel, 
with all writes in BI order. The writes are thereby arranged so that tasks share L{r) = 0(1) blocks 
for writing. Reading, however, is only /(r) = -y/r-friendly. This is a BP computation, so it is a 
type 1 HBP. 

By employing RM to BI initially and suitable versions of BI to RM conversion at the end 
(described below), we obtain algorithms RM-MT (use BI-RM (gap RM)), and RM-Strassen (use 
BI-RM for FFT). We now describe several different methods for converting from BI to RM. 

Direct BI to RM. This simple method uses the same recursion as the direct RM to BI method 
mentioned above. However, since the writes are to an output matrix in RM, both L(r) and /(r) 
are ^/r. 

We now present two improved algorithms for this (with respect to block misses), of which only 
the first method performs O(n^) work. 

1. BI-RM (gap RM). This is an O(logn) parallel running time, 0{n?) operation algorithm. 

This is the same as Direct BI to RM, but to mitigate the block miss cost, we use a gapping 
technique. The destination array representing the RM matrix will be given gaps as follows: between 
r X r subarrays (for values of r corresponding to recursive subproblems) the rows will be given a 
length r/log^r gap. Now, tasks of size for r = Q{Blog^ B) share zero blocks for their writing. 
This gives a cost of 0{Br) for the block misses for a size task, for r = 0{Blog^ B). So 
L(r^) = 0(r), but only for r < B log^ B. 

The justification for this choice of size is that it only increases the size of the array by a constant 
multiplicative factor (for X^^=2» k^p7 ~ 0(1)). 

Indeed a gap of r/[log r(log log r)^], or any analogous sequence of iterates, also works, reducing 
the block miss cost correspondingly. 

Having written to an array with gaps one needs to compress the array using a standard scan. 
This is a BP computation which has /(r) = 0(1) and L{r) = 0(1). 

"^We correct a typo in the cache bound for Strassen found in many papers, starting with [17) . 
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2. BI-RM for FFT. This is an O(logn) parallel running time, 0(n^ log log n) operation algo- 
rithm. The algorithm divides the input BI array of length into n subproblems, each of which 
it recursively converts to the RM order. Then, using a BP computation, it copies the n subar- 
rays into one subarray, accessing data according to the RM order in the target output. This is a 
type 2 HBP computation that calls c = 1 collection of v{ii?) = n subproblems of size ■s(n^) = n. 
The BP computation for the copying is organized so that the writes are in RM order, and hence 
L(r) = 0(l). 

We now show that /(r) = O(y^), assuming M > . Consider a size r task r performing a 
portion of the computation on a A; x subproblem. The input to this k y.k subproblem consists of 

rows of y/k submatrices that are Vk x y/k. Each of these \/k x \/k matrices has already been 
converted to RM by recursive calls. 

Let r = s ■ k -\- s' ■ \fk + s", where < s', s" < \fk. We consider here the case when s > 1; the 
case when s = is handled similarly. 

The task r reads in s full rows of the output k x k matrix, plus part of the (s + l)st row. 

If B > \fk then by the tall cache assumption, M > B^ > k. Hence, when the data is read row 
by row according to the output k x k matrix, one block will be read from each of the Vk x \fk 
matrices in a given row. Further, under LRU, the \fk blocks will be all in cache when the data for 
the next row of the output matrix is read, and hence, within each \fk x \fk matrix, the number 
of blocks read is the scan bound, leading to a bound of < ^^^"^^^ _)_ cache misses for this 
computation. Hence the number of cache misses is 

If -B < then reading a single row in one of the ^fk x \fk matrices will incur 0{Vk/B) cache 
misses, and hence the total number of cache misses is 0{r/B). 

FFT. We expose the parallelism in the six-step variant of the FFT algorithm [H [21] which 
is shown to have optimal Q{n, M, B) in [17J. As noted in [11], this is also a low-depth multicore 
algorithm. The algorithm views the input as a square matrix, which it transposes, then performs a 
sequence of two recursive FFT computations on independent parallel subproblems of size @{^/n), 
and finally performs MT on the result. The sequential time is 0(n log n) and the sequential cache 
complexity is • logj^^n) [17], and the parallel depth is readily seen to be 0(logre • log log n). 

We keep the matrices in the BI representation. Thus, the HBP algorithm FFT, when called 
on an input of length n, makes a sequence of c = 2 calls to FFT on v{n) = y/n subproblems of 
size s{n) = ^/n with a constant number of BP computations (mainly MT) performed before and 
after each recursive call. We have /(r) = 0(1) and L{r) = 0(1), outside of the cost to convert 
between BI and RM formats. At the end, to convert to the RM format we use BI-RM for FFT 
Thus /(r) = ^/r for the overall FFT algorithm due to the need to use RM to BI and BI-RM for 
FFT. 

List Ranking (LR). The list ranking problem is known to require min{n, logj^/ n} cache 
misses even in a sequential computation. We match the second term, since it determines the bound 
for the normal range of parameter values. Our algorithm for LR uses the resource oblivious sorting 
algorithm SPMS in our recent paper [12], whose bounds are the same as those for FFT. The Euler 
tour and tree computation algorithms have the same complexity as LR. 

Efficient multicore algorithms for LR based on eliminating large independent sets are given in 
[3tllH[6]. As in [6] we adapt the PRAM algorithm that performs O(loglogn) stages of eliminating a 
constant fraction of the elements in the linked list, and then switches to the basic pointer jumping 
algorithm when the size of the linked list falls below n/logn. To find a large independent set, 
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we use the method MO-IS in [llj that constructs an ©(log^*^-* r)-size coloring of the hnked hst, 
and then extracts an independent set of size at least r/3 (where r is the current length of the 
linked list) by examining elements of each color class in turn. A phase on a list of length r 
performs 0{log^''^ r) calls to SPMS on inputs whose combined length is r. Thus it can be shown 
[TT] that a phase on a list of length r completes with 0{-^ logjv/ t) cache misses sequentially, and in 
0(log r ■ log log r ■ log'-''-* r) parallel time when using SPMS for sorting, since SPMS has parallel time 
complexity 0(log r-log log r). Once the algorithm switches to pointer jumping, each pointer jumping 
stage can be performed with 0(1) calls to SPMS and scans on a list of length 0(n/logn), hence 
overall this incurs O(nlogn) work and has 0((n/B) log n) cache misses and 0(log^ nloglogn) 
parallel time. This pointer jumping phase of the algorithm gives rise to the dominant cost for the 
basic parameters of LR listed in Table 1. 

Since LR makes calls to the type 2 HBP sorting algorithm SPMS before calling itself recursively 
on c = 1 sequence of v{n) = 1 subproblem of size s(n) < 2r/3, this is a type 3 HBP algorithm. 

To reduce the number of block misses in the recursive calls, we introduce gaps between the 
elements of the contracted linked list as follows: When the list has size n/x^, it is written in space 
n/x, using every xth location only. Thus, when the list has size n/B"^ or less, no more block misses 
occur. Note that this modification of the list ranking algorithm does not affect the cache miss cost 
beyond a constant factor. This is because each of the BP computations from which the sorting 
algorithm SPMS is built has a cache miss cost no larger than the cost on an array of equal length 
with all entries occupied. Hence this holds for SPMS as well, and hence for each recursive call in 
the list ranking algorithm. As the space used is still shrinking by a factor of at least 2 every 4 
iterations, the same geometrically shrinking costs occur, leading to the same asymptotic bounds. 

CC. We use the connected components algorithm in [11]. The dominant cost in this algorithm 
is log n stages of list ranking, so our resource-oblivious implementation increases each of the work, 
parallel time, cache complexity, and block miss cost by a factor of log n. 

3.3 Data Layout and Block Wait Costs 

By definition, an HBP algorithm is a limited-access computation. Additionally, all of the HBP 
algorithms we have considered are are either 0(l)-block sharing, or incorporate gapping. These 
features are useful in reducing block wait costs. However, space gets reused on the execution stacks 
of the cores, and this may cause block misses beyond those that can be inferred by analyzing 
accesses to variables in the algorithm. 

Execution Stacks. Each core C, when it starts executing a task r, will create an execution stack 
Sr to keep track of the procedure calls and variables in the work it performs on r. The variables 
on Sr may be accessed by stolen subtasks also. As Sr grows and shrinks it may use and then stop 
using a block /3 repeatedly. Thus, it could be the case that a computation with limited-access and 
0(l)-block sharing still incurs a large block wait cost due to accesses to the execution stacks. 

Lower and upper bounds on the space used for the variables declared in an HBP procedure 
affect the block wait as shown in our companion paper |13j . To obtain our bounds, we need that 
the space used by the local variables of a Type 2 HBP procedure dominate the logarithmic space 
used by its BP subtasks. For simplicity in our presentation, we make this a requirement that linear 
space be used by these local variables. All of our algorithms satisfy this property. 

Accordingly, we make the following definition. 
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Definition 3.6. A Type 2 HBP algorithm is Exactly Linear Space Bounded if the variables declared 
by each Type 2 task r use space 0(|r|). 

In our companion paper [13] we prove the following lemma. 

Lemma 3.1. (i) Let A be a limited- access BP Algorithm and let r be either the original task in 
the computation of A or a task which is stolen during the execution of A. Let /3 be a block used for 
t's execution stack Sr- Then f3 incurs a block delay o/ 0(min{i?, log(|T|)}) during t's execution. 

(a) Let A be a limited- access, exactly linear space bounded Type 2 HBP algorithm, for which 
each collection of recursive calls consists of subproblems of size at most s{n) < (1 — ^)n, for some 
constant < 7 < 1. Let s^'''\n) be the function s iterated i times. Let r be either the original task 
in the computation of A or a task which is stolen during the execution of A. Let (3 be a block used 
for T 's execution stack Sr • Then the number of transfers of block j3 during the execution of r is 
bounded by 



If s{n) < (1 —^)n/c this is an 0(min{ci?, |t|}) bound. 

All of our HBP algorithms satisfy the requirements of the above lemma. 

Data Layout in a BP Computation. The computation at a node u during the up-pass in a BP 
computation involves updating 0(1) data on its execution stack, spread across at most c = 0(1) 
blocks, and possibly 0(1) updates to the output data. In the case of output data we assume that 
the data for each size r BP computation is in an array of size 0(r), and is stored according to an 
in-order traversal of its up-tree. So, for instance, in M-Sum, the output data at each node in the 
up-tree is the value of the sum of the input values at the leaves of the node's subtree, and these 
values are stored in the order of an in-order traversal of the up-tree. The advantage of this layout 
is that it will result in no block misses for accessing output data at levels in the BP tree where the 
number of leaves in each subtree is greater than B. 



In this section we present PWS, a deterministic work stealing scheduler. We show that the excess 
cache misses over the sequential cache miss cost in a BP computation when scheduled under PWS 
is 0{pM/B) with a standard tall cache M > B^ if /(r) = O(y^). For block misses, we bound 
the block wait cost for a size n BP computation with L(r) = 0(1) by 0{BplogB). We build on 
these results to obtain bounds for our HBP algorithms. In our analysis, we also bound other costs, 
including usurpation costs and idle time. In Section 14.71 we give a distributed implementation of 



PWS assigns to each task a priority that decreases with increasing depth in the computation 
dag, so there are up to Too different priorities. Steals in PWS proceed in rounds, one for each 
priority, and are performed in non-increasing priority order. 

When a core C executes a task r, the tasks it places on its task queue will have lower priority 
(i.e., larger computation depth) than r, hence it is not difficult to see that at most p — I tasks of 
any given priority are stolen under PWS. Further, the priorities are assigned so that all tasks with a 




4 Priority Work Stealing Scheduler (PWS) 



PWS. 
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given priority have the same size, to within a constant factor; note that this is readily accompHshed 
for BP and balanced HBP computations. 

In this section we bound the various costs incurred by a balanced HBP computation when 
scheduled under PWS. We establish that the main costs arise from the cache and block misses, 
which we bound in terms of their excess, which is the amount by which these costs exceed the 
sequential cache complexity of the computations. The following two lemmas give the bounds we 
derive for the cache and block miss excess for the Type 2 HBP algorithms we consider. We analyze 
LR (Type 3) and CC (Type 4) separately, building on these results. These lemmas are established 
in Sections 14.21 and 14.31 respectively. 

Given a function s{n) < n, recall that s*(n) is the minimum i such that s^^\n) < c, for a 
suitable constant c; here, s^^\n) = n, and s^^\n) = s{s^^~^\n)), if i > 0. We also define s*{n,M) 
as the minimum i such that s^''\n) < M. 

Lemma 4.1. Let II be a balanced Type 2 HBP computation of size n > Mp, and let c, s(n), and 
f(r) be as defined earlier. Then, the cache miss excess for H when scheduled under PWS has the 
following bounds with a tall cache M > . 

(i) Ifc=l, f{r)=0{^): 0{p§s*{n,M)). 

(ii) Ifc = 2, f{r) = 0(VF), and s{n) = ^: 0{p§^). 

(iii) Ifc = 2, f{r) = 0{^), and s{n) = n/4; 0{p[^ + ^ Ei>o 27(M/4^)]). 

Lemma 4.2. Let Yl be a balanced Type 2 HBP computation of size n > Mp with a = 1/2 and 
L[r) = 0{1), which is exactly linear space bounded, and let c, s{n), and L(r) be as defined earlier. 
Then, the block miss excess for H when scheduled under PWS has the following bounds. 

(i) c= 1: a cost of 0{pB log B ■ s*{n)) cache misses. 

(ii) c = 2 and s{n) = ^/n: a cost of 0{pB lognloglog B) cache misses. 

(iii) c = 2 and s{n) = n/4.- a cost of 0{pBy/n) cache misses. 

4.1 PWS Scheduling 

As mentioned earlier, the PWS scheduling requires tasks to have integer priorities with the property 
that on any root to leaf path in the computation tree T the priorities are strictly decreasing. PWS 
proceeds in rounds^ one for each integer priority. The steals under PWS are performed in decreasing 
priority order, which, loosely speaking, is also a size-based breadth-first search order (since the 
priorities will be a function of the sizes). 

The first round starts when the core C that began computation H places its first task ti on 
its task queue. Let this task have priority di. The priority of round 1 is di, and during this first 
round, any of the other cores can steal ri. Round 1 concludes when ri has been removed from the 
head of C's task queue and assigned to an idle core C . 

In general, a round with priority d concludes when no task queue has a task of priority d at its 
head and every non-idle core has generated a task on its task queue. This starts the next round 
whose priority d' is the priority of the highest-priority task at the head of a task queue. During 
this round, tasks of priority d' at the heads of task queues are stolen by idle cores until no task at 
the head of a task queue has priority d' . 

Observation 4.1. The priorities of tasks in the task queue of a core at any point in time are 
strictly decreasing from top (i.e., the head) to bottom. 
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Observation 4.2. // a steal request by a core is unsuccessful in a round with priority d, then any 
remaining task has smaller priority. 

Observation 4.3. For each task priority, there are at most p — 1 tasks of that priority that are 
stolen. 

Corollary 4.1. Let the number of distinct task priorities be D' . The total number of steal attempts 
(including both successful and unsuccessful steals) across all cores is at most 2 ■ p ■ D' . 

4.2 Cache Misses under PWS 

In a BP computation, the priority of a node at depth i is i, and priorities decrease with increasing 
depth. By the definition of a BP computation, all tasks with a given priority have the same size, 
to within a constant factor. 

We start with the following useful lemma, which uses ci, C2 from the definition of a BP com- 
putation. 

Lemma 4.3. Consider the downpass of an f -cache friendly BP computation IT scheduled under 
PWS on p cores, each with a cache of size M. Let d be the number of distinct priorities in H, and 
Q its sequential cache complexity. 

If f{r) = 0{r/B) for r > (2ci/c2) • M, then the number of additional cache misses incurred by 
all stolen tasks of size greater than 2M is 0{Q). 

Proof. By Lemma |2. II and since ci < C2, a stolen task of size greater than 2M has zero cache miss 
excess. What remains to be argued is that for each task r of size greater than 2M, the additional 
cache miss cost of the task kernel Pr that remains after tasks of size 2M or larger are stolen at a core 
can also be bounded. For this, consider a task r (either the root task or a stolen task) executing 
at a core C. If there are no steals then the result holds vacuously. Otherwise let r„ be the last task 
of size 2M or larger stolen from C while executing r, and let v be the node in the computation 
tree for r corresponding to r^. Then, by Observation 13.11 the task at the sibling of v is not a 
stolen task and is part of the task kernel Pr. But has size at least (ci/c2) • \tv\ > (2ci/c2) • M. 
Further, p^ consists of a sequence of maximal sub-tasks in r, all of which have size at least 2^M 
(since is the last and smallest of these tasks). Hence since /(r) = 0{r/B) for r > (2ci/c2) • M, 
the sum of the costs of f{ri), where rj is the size of the ith maximal task in p^, is bounded by Pr^s 
sequential cache complexity. 

The 0(1) cache miss cost at each task-head at the parent of a stolen task can be bounded as 
follows. Consider the tree formed by the nodes for stolen tasks of size 2M or larger, the ancestors of 
these nodes, and all siblings of such nodes and their ancestors. This tree is defined with respect to 
the tasks computed at all the cores. This is a binary tree in which each non-leaf has two children, 
and each leaf has size at least (2ci/c2) • M. Thus each of these leaves incurs n{M/B) cache miss 
cost. It follows that each leaf can safely be allocated a further 0(1) cache misses. As there are more 
leaves than internal nodes, the 0(1) cache-miss cost for each each internal node can be allocated 
to a distinct leaf; this includes the nodes for each task-head at the parent of a stolen task. 

Hence by the observation that any sequential execution of a task of size r > 2M will incur 
at least r/B cache misses, we can bound the cache miss cost of executing the task kernels by the 
sequential cache complexity 0{Q). ■ 
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The above lemma bounds the additional cache misses incurred when a task is stolen. We note 
that there is another source for cache misses incurred due to steals, and this occurs when control 
of a task transfers to a stolen task after a join in the computation. This motivates the following 
definition of usurpation. 

Definition 4.1. Let t be a task whose kernel is being executed by core C. Let C complete its 
execution of a subtask t" in r 's kernel, and suppose that the join is performed after t" by its sibling 
task t' , which is a stolen subtask executed by another core C . Then, C will take over the remainder 
of the execution of r 's kernel. This change in the core executing r 's kernel is called a usurpation 
ofT by C. 

A usurpation of r's kernel by core C could result in additional cache misses over the sequential 
computation. We will compute this cost separately in Section 14.2.11 For the remainder of the 
current section, we will bound the cache miss excess due to steals as reflected in Lemma 14.31 We 
start by bounding the number of cache misses incurred by all stolen tasks in a BP computation. 

Lemma 4.4. Let 11 be an f -cache friendly BP computation of size n, and consider its execution 
on p cores, each with a cache of size M . In addition, let Q be the number of cache misses in a 
sequential execution ofU. Finally, suppose that f{r) = 0{r/B) for r > (2ci/c2) • M . Then, when 
executed using PWS, 

(i) The number of cache misses is bounded by 

0{Q+Pi^ + logB) + ^p. f{2Ma^)). 

(ii) If M > B log B and f{r) = 0{\), the number of cache misses is bounded by 0{Q + p ■ M/B ). 
(Hi) If M > B^ and f{r) = 0{^/r), the number of cache misses is bounded by 0{Q + p ■ M/B ). 

Proof, (i) By Lemma l4.3t it suffices to bound the cache misses incurred by stolen tasks of size less 
than 2M. Let d be the largest priority of any such task. Then tasks of priority d — j have size at 
most 2Ma^ . Each such task incurs 0{\Ma^ /B~\ + f{2Ma^)) cache misses. Summing over all tasks 
yields 0{p{M/B + logB) + pY.j>o fh^a^)) cache misses. 

[ii) The summation is 0{pM/B) in this case, and the bound follows. 

{Hi) Since /(r) = 0{y/r), for r > M we have /(r) = Oir j \fM) = 0{r/B). Thus part (i) 
applies. It remains to observe that P^j>Q f{2Ma^) < 0{p^/M) = 0{p ■ M/B), which yields the 
claimed bound. ■ 

Let Q{7r) be the sequential cache complexity of a computation tt, and let QpwsiT^) be the 
number of cache misses incurred by vr when it is scheduled under PWS. Then define the PWS 
cache miss excess, QciT^), as follows: QciT^) = QpvKS' ('?'") if QpwsiT^) = ^{Qi'^))^ and QciT^) = 
if QpwsiT^) = 0{Q{tt)). Thus for instance, in parts (ii) and (iii) of the above lemma, the PWS 
cache miss excess for any BP computation with /(r) = 0{^/r) and M > B^ is Qci"^) = 0{Mp/ B). 
In particular, when n > Mp, i.e., when the input does not fit into the caches, the PWS cache miss 
excess is zero. 

The following corollary addresses smaller input sizes (n < Mp) for a case that arises in our list 
ranking and graph algorithms. 
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Corollary 4.2. Let vr he an f -friendly BP computation of size n < Mp and suppose that a 
and f{r) = 0(1). Then, the PWS cache miss excess Qci"^) is: 
(i) For Bp<n< Mp, Qc{tt) = o(p\ogB + ^- log 



1/2 



(ii) Forp<n< Bp, Qc{t^) = 0{% log 2 
(Hi) For n<p, Q" = 0(§ log 2 ^ 



min{n,M ■ 
B 



min{n,M} 
B 

+ n). 



+ plog 



2n 



Proof. We proceed as in the proof of Lemma 14.41 As before, we need to sum the cache misses 
due to stolen tasks of size smaller than 2M. There are min{p, jMai^ tasks of size 2Ma^ , each 



2Ma^ 



generating 0( 

log ^^M) cache misses, in Case (ii), 0(-g log ^"""^i^"'^^! ^pi^g'^ ) cache misses, and in Case (iii). 



cache misses. Summed over all j, in Case (i) this gives 0(plog-B + ^ 

n{r 



+ plog^ 

log ^"""^i^"'^} _|_ 72) cache misses. We argue only Case (i). Summing over the increasing values 



2min{n,M} 
B 

of j, starting at j = 0, there are terms 0{n/B) for [(logp)/(n/2M)] values of j, followed by terms 
up to the first value of j for which Ma^ < B, which total 0{n/B), followed by terms 



p 



2Ma^ 
B 



p ■ 0(1) for logi? values of j. The bound in (i) now follows readily. 



A BP collection is a collection of parallel independent BP computations; similarly, an HBP 
collection is a collection of parallel independent HBP computations. Such collections are generated 
when parallel recursive calls are made in an HBP computation. The size of a BP or HBP collection 
is the maximum size among the independent computations in the collection and its total size is the 
sum of the sizes of the independent computations in the collection. It follows from the definition of 
a balanced HBP computation that in any BP or HBP collection that occurs in it, every independent 
computation in the collection will have the same size, to within a constant factor (or C2/C1). 

As our cache miss bounds for a single BP computation depend only on the binary forking and 
priorities based on size, the previous bounds apply unchanged to a BP collection, yielding the 
following corollary. 

Corollary 4.3. Let H' be an f -friendly BP collection of size r, whose total size is n > Mp. Then 
the PWS cache miss excess for H' is bounded by 



0(p-{ 



min{M, r} 



B 



+ log min{r, B} + /(r) 



Proof. The argument is similar to that of Lemma 14.41 The one change is that /(r) cannot be 
bounded by r/B. ■ 



Cache Misses in the Up-pass. In the up-pass that follows a down-pass, the computation 
involving the activation of a suspended task r follows the mirror image of the initial pass that forked 
the children of that suspended task. Since we have assumed that there is only 0(1) computation at 
each suspended task-head, the PWS cache miss cost of the up-pass is readily bounded. The main 
additional cost arises because a stolen task may assume the remaining work of its parent due to 
usurpation, in which case it needs to read data from its parent's local stack. The cost is bounded 
by the total size of these stacks divided by B, which is O (number of nodes in the BP tree/i3), plus 
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an additional 0(1) cache misses per steal. This is subsumed by the cost of the downpass. Thus 
there is zero contribution from the up-pass of a BP computation vr to the PWS cache miss excess 
for vr. Note, however, that these reads may interleave with writes to these same execution stacks 
by other cores, and we will analyze their costs as part of the block miss analysis. 

4.2.1 HBP computations 

We extend priorities to a balanced HBP computation C in a natural way. Priorities decrease with 
depth, and in each BP collection in C, all nodes at depth i in the collection have the same priority, 
i > 0. Thus, all nodes with the same priority will have the same size, to within a factor of c^/cf . 

We now bound the cache miss excess in the HBP computations we consider. First, by the same 
argument as for Lemma 14.31 we obtain: 

Lemma 4.5. Consider an f -cache friendly balanced HBP computation H scheduled under PWS on 
p cores, each with a cache of size M. If f{r) = 0{r/B) for r > (2ci/c2) • M , there is zero PWS 
cache miss excess due to stolen tasks of size 2M or more. 

Usurpations. Consider two successive HBP collections Hi = {/in, /ii2, • " " i^ifc} and H2 = 
{/12I1 ^22) ■ ■ ■ i^2fc}i where, for each i, h2i follows hu in the sequential computation. Let hu, for 
some i, start its computation on some core C. Because of usurpation it may be that a core C", 
that stole a subtask from C within hu, will be the one starting the computation h2i; we call C 
a usurper. Further, even if C starts the computation of /i2i, it may be that, due to steals, C has 
not read in some of the data for hu that potentially gets reused in /i2j. In this case we say that 
C is semi-usurped. In both cases (usurped and semi-usurped) we need to bound the cost of the 
additional reads needed to execute /i2i since our analysis in Lemma 14.41 assumed that the core 
starting the BP computation had the same state as in the sequential computation. If k is large, 
then conceivably there could be a large number of usurpers at H2 resulting in a large cache miss 
cost. We now argue that this is not the case for balanced HBP collections. 

Lemma 4.6. Let Hi = {hu, hi2, ■ ■ ■ , hik} and H2 = {h2i, /i22) ■ ■ ■ h2k} be successive balanced HBP 
computations. Then, there are at most p — 1 usurpers and semi- usurpers. 

Proof. Let the first sub-task stolen in Hi be h' , and let it be a sub-task of hij, for some j. At the 
time h' is stolen, all of the hu have started their execution, since otherwise there would be a task 
of higher priority than h' available for stealing. But then, at most p — 1 of the hu can still be in 
the process of being executed. These are the only tasks that can be usurped or semi-usurped. ■ 

As noted earlier, all of our HBP algorithms are balanced, hence the above lemma applies to 
them. The sorting algorithm SPMS in [12j is not a balanced HBP computation, however a different 
argument bounds the cost of usurpations in that algorithm. 

Cache Misses in Balanced HBP Computations. By Lemma 14.51 stolen tasks of size 2M or 
larger in an HBP computation incur no cache miss excess under PWS. In the following lemma, 
we bound the excess due to steals of smaller tasks in an HBP computation using Lemma 14.41 and 
Corollary 14.31 and we bound the cost of usurpations using Lemma 14.61 

Lemma 4.7. Let vr be an f -friendly balanced HBP computation with f{r) = 0{^/r). If M > B^ , 
the PWS cache miss excess for vr is 
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0{p ■ v'{M/B + log + p • Y,{ni/B + \ogB + fim)) + p ■ ^(log + /(n,)) ), 



where v' is the number of BP collections of size at least 2M , the first sum is over BP collections 
whose size is rii, B < rii < 2M , and the second sum is over BP collections whose size is rij < B. 

Proof. The cost of stealing small tasks is bounded by applying Lemma 14.41 to the initial computa- 
tion, and then applying Corollary 14.31 to the BP collections in the computation. 

To bound the cost of usurpers we apply Lemma [4.61 Since there are at most p — 1 usurpers and 
semi-usurpers for each BP collection, and we have already charged for p — 1 steals at the start level 
for this collection, the usurpations increase this cost by at most a factor of 2 for each collection. ■ 

Using Lemma 14.71 we obtain the following Lemma 14.11 which was stated at the beginning of 
Section HI 

Lemma 14.11 Let H be a balanced Type 2 HBP computation of size n > Mp, and let c, s(n), and 
f{r) be as defined earlier. Then, the cache miss excess for 11 when scheduled under PWS has the 
following bounds with a tall cache M > B'^. 

(i) Ifc=l, f{r)=0{^): 0{p^s*{n,M)). 

(ii) Ifc = 2, f{r) = 0(VF), and s{n) = ^: 0{p§^). 

(ill) Ifc = 2, f{r) = 0{^), and s{n) = n/4; 0{p[^ + ^ E.>o 2^f{M/A^)]). 
4.3 Block Misses Under PWS 

Let vr be a BP collection of size r and total size n in which each task r shares at most -Z>(|t|) blocks 
with other tasks. There are at most —n/r BP trees of size r in the collection, hence each level 
below level log(pr/n) will have at least p tasks. Under PWS, there are at most p steals at any 
level, hence the total number of shared blocks across all cores due to steals under PWS in this 
computation is X{r), where 

X{r) < J2 2* • L{ra') + p ^ L{ra') (1) 

log(n/r)<j<logmin{p,ri/l°gi/'»} logp<i<log r^/ l°s i/q 

Hence, using Lemma 13.11 we obtain that the sum of the block waits across all cores during this 
computation is bounded by Z(r) = Y(B) ■ X(r), for r > B, and by Z(r) = Y(r) ■ X{r), for r < B. 

For a computation tt scheduled under PWS, let Qpws,b{'^) be the block wait cost of n, and 
as before, let (3(vr) be the sequential cache complexity of vr. As with the PWS cache miss excess, 
we define the PWS block miss excess, Q_B(vr), as vr's block wait cost Qpws,Bi'^) if Qpws,Bi'^) = 
u){Q{'ir)), and QsiT^) = otherwise. 

Lemma 4.8. Let vr be the downpass of a BP collection of size r. Suppose that Y{\t\) = 0(min{i?, |t|}). 

(i) If the block-sharing function L{\t\) = 0(1), then, for r > B, the PWS block miss excess for 
IT is 0{p ■ BlogB); for r < B, the block miss excess is 0{p ■ rlogr). 

(ii) If the block-sharing function L{\t\) = \/\t\, and a = ^, then, for r > B, the PWS block 
miss excess for vr is 0{B ■ y/pr); for r < B, it is 0{r ■ y/pr). 
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Proof, (i) For r > B^, we first consider the stolen tasks of size B^ or larger. Similar to the proof 
of Lemma 14.31 we can see that the kernel of task r (either a stolen subtask or an original task in 
this BP collection) after all its stolen subtasks of size B^ or larger are removed still has size Q{B'^). 
Consequently, its execution (which may include further steals of smaller subtasks) will incur a cache 
miss cost of Q{B). As the block wait cost for a stolen task is 0(B), we can conclude that the block 
miss excess for all original and stolen tasks of size B^ or larger is zero. 
If r > B^ , then the remaining cost is bounded by 

B-p L{ra') = 0{BplogB). (2) 

For r < B^, the cost is bounded by 0(min{r, i?} • P^^i^Q L{ra'^)) = 0(pmin{r, i?} log r). 

(ii) This follows directly from Equation [TJ ■ 



Block Misses in the Up-pass. Recall that there are no steals during the up-pass. Instead, as 
mentioned in Section [2l at any non-leaf node n in a BP tree, of the tasks at its two children, the 
later finishing one will continue the up-pass from u toward the root of its BP tree. The following 
lemma assumes the data layout given in Section [3.31 

Lemma 4.9. Let p be the up-pass of a BP collection of size r and total size n, scheduled under PWS. 
Then, for r^^^°^^/'^ > p, the block miss excess for p is 0(min{r, i?}p(logi/Q, r — log(max{2,pr/n}))); 
for r^/^°s'^/o^ < is 0(min{r, i?}p). For a = 1/2, and r > B, the block miss excess is 

0{Bp\ogB), if n> pM and M > B^ ; for a = 1/2 and r < B it is 0{rplogr). 

Proof. We first note that the computation at each node in the up-pass is of constant size, hence 
by the limited-access property there is an 0(min{r, i?}) block wait cost at each node regardless of 
the nature of the block-sharing function L(|r|). 

Recall that the two children of an internal node in a BP tree have size between cia ■ r and 
C2a • r, for some a with 1/2 < a < 1 and constants < ci < 1 < C2. Hence the height of the tree 
is log;^/„ r + 0(1) = log2r^ + 0(l), where y = l/log2(l/a). Then, the number of nodes in the top 
level is at least n/r and this number increases to p by level / = log q, where q = max{2,pr/n}. (We 
assume that M > C2/C1 so that leaves are not encountered before level q.) 

We first consider the total block wait cost at the top I levels of the BP tree collection. There are 
at most 2p—l nodes in this portion of the BP tree collection, and the tasks at the two children of a 
node u are the only ones that access u for their computation. By the limited access property, each 
of these Ap — 1 tasks incurs a block wait cost no more than 0(min{r, B}). Hence the contribution 
of the top logp levels of the up-pass to the total block wait cost is 0(min{r, B}p). For q > r^, this 
cost reduces to 0(min{r, B}ry) over all the logj^/^ r + 0(1) levels of the up-pass. 

Only in the case that q < r^ are there further levels to consider. We now consider these 
bottom log(r^/g) + 0(1) levels. Under PWS, there are at most plog{ry /q) stolen tasks in these 
levels. Further, consider a block f3 on an execution stack that is accessed by x stolen tasks. 
Each of these stolen tasks could incur a block wait cost of up to x due to its accesses to f3. 
But only one of these tasks would continue computing further up its computation tree. This 
is because the block /3 contains the execution stack data for a sequence of ancestors in the BP 
tree, and only the task that completes the computation at the highest ancestor will continue 
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the computation up the tree. Hence, if there are s different blocks accessed on the execution 
stacks across ah tasks during this computation, and if rj, 1 < i < s, is the number of tasks 
accessing the ith block, then 1 + Yli=ii''"i ~ -'-) bounded by c^jcx times the total number of steals 
in the log(r^/g) levels, which is at most cplog(r^/g). Since s = 0(plog(r^/g)) it follows that 
Yli=i^i = 0{plog{ry /q)), hence by the limited access property, the total block wait cost incurred 
in these log(r^/q) levels is 0{Bp\og{r^ /q)) . The block wait cost for accessing other data items also 
satisfies this bound since each task will access 0(1) data with 0(min{r, B}) block wait cost, which 
is accounted for in the above bound. Since y = l/log2(l/a), this establishes that the block miss 
excess is 0(min{r, B}p{logi^^ r — log(max{2,pr/n}))). 

For r > B, when a = 1/2, the block miss excess is 0{Bp\ogn/p). 

We now show that Bp\og{n/p) = 0{n/B + BplogB) if n > pB^ (which holds if n > pM and 
M > ^2). 



Bplog{n/p) < Bp{\og{n/{pB'^)) +\ogB'^) 
ft / B 

< // „9s • \og{n/{pB'^)) + BplogB"^ < n/B + 2BplogB 
n/ypB'^) 

Hence the block miss excess for the up-pass is 0{Bp\ogB) in this case. 
For r < B, the claimed bound is immediate. 



We can now prove Lemma 14.21 by applying Lemmas 14.81 and 14.91 

Lemma 14.21 Let H be a balanced Type 2 HBP computation of size n > Mp with a = 1/2, which is 
exactly linear space bounded, and let c, s(n), and L{r) be as defined earlier. Then, the block miss 
excess for H when scheduled under PWS has the following bounds if L{r) = 0(1). 

(i) c= 1: a cost of 0{pB log B ■ s*{n)) cache misses. 

(a) c = 2 and s{n) = ^Jn: a cost of 0{pB lognloglog B) cache misses. 
(Hi) c = 2 and s(n) = n/4.- a cost of 0{pB^/n) cache misses. 

Proof. The bounds are obtained by summing over the cost of the successive collections of BP 
computations. 

(i) follows from Lemmas 14. 81 and 14. 91 as there are s*{n) successive collections of BP computations 
each collection costing 0{BplogB) cache misses. 

(ii) The cost of the BP computations is bounded by 
0(-Bp^|!fo'°^""'°^^°s-^2MogB +p^j.>i52J+i°si°s"-i°gi°g^log5i/2') = 0{BplognloglogB). 

(iii) The cost of the BP computations is bounded by 

0(5pE,i?^"^^^2nog5 +Ej>iBp ■ 2-'+^i°g("/^)log(S/22i)) = Oip ■ V^logB + pB^) = 
0{pB^/E). m 



4.4 Idle Time and Scheduling Costs 

We now bound the total work during a computation vr that could be attributed to idle cores (i.e., 
when a core is neither computing, nor waiting on a cache miss, block miss, or steal initiated by 
that core). 
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Idle time can occur when some cores are executing the up-pass of a BP computation. Consider 
the up-pass p of a BP computation of size n. We bounded the block miss excess in Lemma 14.91 
Here we consider the idle time incurred by such an up-pass. For this, we need to account for the 
time that some of the cores may need to wait in order for the up-pass to complete so that new 
tasks can be generated on the task queues and become available for stealing. This wait time starts 
at the time T when the last task in the downpass starts its computation, and it ends when the 
computation completes at the root of the up-pass tree. 

Lemma 4.10. Let p be the up-pass of a limited- access BP collection of total size n. The total idle 
time incurred during the up-pass is hounded by 0{bp- {logn + BlogB)), where b is an upper bound 
on the delay due to a cache miss. 

Proof. Let t be an upper bound on the computation time at a leaf in the downpass of the BP 
computation, followed by the computation from the leaf for log(c2S/ci) levels up along the path 
to the root of the up-tree. We note that t = 0{bBlogB) by the limited access property, and the 
fact that base case tasks and up-pass nodes perform 0(1) computation. Also, once the up-tree 
computation is above the lowest I = log(c2-B/ci) levels of the up-tree, a write to each output data 
has zero block wait cost, since by the in-order organization of the output data, the writes at any 
two nodes are separated by at least a distance of B in the output. Thus, at levels above / in the 
up-tree we only need to consider block sharing on the execution stacks of the tasks. 

Let the number of blocks accessed at each node in the up-tree be c = 0(1). Let the last task in 
the downpass start its computation at time T. We establish the following assertion by induction 
on the height of a node u in the BP tree. 

Idle Time Assertion. Let u he a node at height hu > log(c2i?/ci) in the up-tree. The up-pass 
computation at u is completed by time T^ = T -\-t-{-h{ch — ku), where h is the height of the up-tree, 
and ku is the number of writes not yet performed at proper ancestors of u in the up-tree. 

The proof is by induction on the height of u in the up-pass tree, and takes into account the fact 
that the entries on any given execution stack is a sequence of data on successive ancestors in the 
tree. 

Base Case. The node u is at height log(c2i?/ci). Then, it is done by time T + t. Since ch is the 
total number of writes that need to be performed at all nodes on a path from a leaf to the root, 
we have ch > kw for every node w in the up-tree, hence T + t < Tu, and hence u's computation is 
done by time T„. 

Induction Step. Assume inductively that the result holds for all nodes at height up to /i — 1 > 
log(c2-B/ci). Let u be a node at height h — 1, let v be u's sibling, and let w be u's parent, which is 
at height h. 

By the inductive assumption, node u's computation is completed by time Tu- Further, since 
u and V have the same proper ancestors, v's computation is also completed by time Tu, as is the 
computation at all descendants of u and v. 

Now consider the computation at w. It is completed by time t^j = Tu + b ■ (x + y'), where x is 
the number of accesses performed by the task at w, and y' is the number of accesses performed by 
tasks that write into blocks shared by w. By the inductive assumption all nodes at height h — 1 
and lower have completed by time Tu, hence all of these writes are to locations at proper ancestors 
of w. Let y be the number of writes performed at all proper ancestors of w during this time. Then, 
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y' < y, and = ku — {x + y), hence, 

t^ = T + t + b{ch - ku) + b{x + y') <T + t + b{ch - {k^ - {x + y))) = 
This estabhshes the induction step. 

With this claim we have the result that the delay for the up-pass beyond time T is 0{h[B log 



Let sp be the delay at a core due to the cost of a steal under PWS. When a computation with 
critical pathlength D is scheduled under PWS, it incurs a total scheduling cost oi 0{p ■ sp ■ D) by 
Corollary 14. 11 We can expect sp >b, since at least one cache miss is incurred when a core attempts 
to steal a task from another core. 

Lemma 4.11. Let tt be a BP computation on an input of size n. If sp = i^{b), then the total time 
spent on idle work by allp cores in executing tt is bounded by the sum of the bounds of 0{sp-plogn) 
on the scheduling cost and 0{bBp\ogB) on the block delay costs. 

Proof. As shown above, the idle time for the up-pass is 0{pb{B log B + logn)). The first term is 
dominated by the cost 0{pB log B) we established for the PWS block miss excess and the second 
term is dominated by the PWS scheduling cost if sp > 6. The idle time that occurs in the top few 
levels of the downpass, until p tasks are generated, is part of the steal cost sp, since the priority 
increases with each unsuccessful steal. Hence the idle work is dominated by the sum of block delay 
and scheduling costs. ■ 

4.5 Analysis of HBP Algorithms under PWS 

We apply our results for PWS to the HBP algorithms discussed in Section 13.21 Here, we assume 
that each core performs a single operation in 0(1) time, and a cache miss takes at most b time. 
We assume that the input is of size n > Mp (the input size is for matrix computations), and it 
is in the shared memory at the start of computation. 

Caches Misses Only. If we ignore block misses, then for all all Type 1 and Type 2 HBP algorithms 
we consider, PWS achieves the following run time, where Sc is the cost of a steal when only cache 
misses are considered. We show that sc = blogp for our PWS implementation (Section 14. 7p 



This is a new result, which is not known to be achievable using RWS. 

Cache and Block Misses. If we consider both cache and block misses, and sp is the cost of a 
steal when both cache and block miss costs are considered, we achieve the following bound with 
PWS for Type 1 and Type 2 HBP algorithms we consider, where sp = 0{blogp) for our PWS 
implementation if we use a padded version of BP and HBP computations (see Section 14. 7|) , and is 
0{b{B + logp)) with standard HBP computations. 



logn)), and this establishes the lemma. 



0[-{W{n) + b-Q{n,M,B) ) + sc-T^{n) 




if M > 



0[-{W{n) + b-Q{n,M,B) ) + sp-T^{n) 
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Here the tall cache requirement T{B) depends on the problem, and varies between log B and 

B\ 

We now present our results for our Type 1 and Type 2 HBP algorithms, when scheduled under 
PWS. For convenience of notation, we will ignore constant factors in our analyses, and we will use 
> and < in place of O and 0. 

Lemma 4.12. When scheduled using PWS, the following algorithms have the stated running times 
when the input size is Q{Mp), considering both cache and block misses. 

(i) Scans: 0{{l/p) ■ {n + b ■ {n/B) + sp ■ logn) with T{B) = B'^logB. 

(ii) MT (BI) and RM to BI: The bounds from (i) apply, with replacing n. 

(Hi) Strassen (BI): 0{{l/p) -{n^ + b- {n^/{B ■ M^'^)) + sp- log^ n) with T{B) = B^ log^ B. 

(iv) Depth-n-MM (BI): 0((l/p) • (n^ + 6 • {n^/{BVM)) + sp-n) with T{B) = B^. 

(v) BI-RM (gap RM): 0{{l/p) ■{n'^ + b- (n'^/B) + sp • logn) with T{B) = B^ log^ B. 

(vi) BI-RM for FFT: 0{{l/p) • (n^loglogn + (b/B) • (n^log^^n) + sp • logn) with r{B) = 
log 5 log logS. 

(vii) FFT: 0((l/p)-(n log n+{b/ B)-nlog^f n+sp-logn log logn) with T{B) = B^ log B log log i?. 

All of these bounds hold with the standard tall cache M > B^ if only cache misses are considered. 

Proof, (i, ii) Scans, MT (BI), RM to BI. As noted in Section \3.2\ we have a single BP com- 
putation with a = 1/2, /(r) = 0(1), and L{r) = 0(1). Hence by Lemma 14.41 the PWS cache miss 
excess is 0{Mp/B + plogB), and by Lemma 14.8^ the PWS block miss excess is 0{Bp\ogB). 
RM to BI has /(r) = y^, however. Lemma [4.41 continues to hold for all /(r) = O(y^), hence this 
computation has the same bounds as MT. Hence the cache and block miss excess is dominated by 
the sequential cache complexity when n > Mp if M > T{B), for T{B) = B^logB. 

{Hi) Strassen (bit-interleaved layout). From Section 13.21 this is an HBP computation with 
c = 1, r(m) = m/4, where m = n^ is the size of the matrix, /(r) = 0(1) and L{r) = 0(1). As 
noted there, the sequential work is O(n^), the critical pathlength is O(log^n), and the sequential 
cache complexity is 0{n^ / {BM"*)). By Lemma 14.11 substituting s(n^) = n^/4, and c = 1, the 
PWS cache miss excess is 0(p[^ log ^ + log2 B]), and by Lemma SSI the PWS block miss excess 
\s Qb = O {pB log B log n^). 

The most constraining constraint is the block miss excess Qb- We now derive a bound for T{B) 
based on Qb- We need Qb = pB log B log < which is satisfied if pB'^ log B ■ M^l'^^^ < 

thus pB^ log B ■ M^/^-i iog(p5M) < n^ suffices. 

Since we have assumed that n^ > Mp, we have > {Mp)^/'^ . 

Thus it suffices to have log 5(pM)V2__L_ log(pMS) < (Mp)^/^, and this is met if 

B^logBlogp+logM+logB < j^g^^ > j^g^ ^A/2-1 > j^g ^ ^ > 2.5), P+log M+log B < 

logM. Thus ^ log^iogA/ ^ suffices, and for this the following stronger tall cache condition 
T{B) = B^ log2 B <M suffices. 

(iv) Depth-n-MM (BI layout). Here we use the cache-oblivious n'^-work MM algorithm in |17) . 
as modified in [13J to enforce limited access variables. This is an HBP computation with c = 2, 
and r[n'^) = n^/4, with /(r) = 0(1) and L{r) = 0(1), hence by part (Hi) of Lemma l4.ll the cache 
miss excess is 0{p{l + "'^^ )). The block wait cost is 0{pnbB) by Lemma 14.21 The cache miss 

3 

excess and block delay cost are bounded by the sequential cache complexity of foi" input 

size > Mp with a taller cache having T{B) > B'^. 
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(v) BI-RM (gap RM). This is an O(logn) parallel running time, 0{n?) operation algorithm. 

This is a sequence of two BP computations where the second computation has the bounds of 
a standard scan, so the cost is dominated by the first BP computation. As described in Section 
13.21 this first BP computation has /(r) = 0(1), and the cache miss excess remains 0{Mp/B) for 
M > B^logB. For the block miss excess we have L{r) = y^. However, its effect is reduced using 
a gapping technique, the result of which is that there is no block miss cost for tasks of size greater 
than cr = 52 log^ B. The total block miss cost for stolen tasks of size a or less is dominated by the 
stolen tasks of size s due to the geometrically decreasing sizes in the BP computation. Hence the 
block miss excess is Since the sequential cache complexity is n /B, this dominates 

the block miss excess for input size > pM with a taller cache T{B) > B^ log^ B. 

{vi) BI-RM for FFT. O(logn) parallel running time, 0(n^ log log n) operation algorithm. 

Recall from Section 13.21 that this is a type 2 HBP computation with c = 1, with s(n^) = n. 
From Section [32] we have /(r) = 0(-v/r) and L(r) = 0(1). This gives a cache miss excess of 
0(p-^ loglogjv/ By lemma W?2\ the excess block wait cost is O (pi? log i? log log i?). This is 
dominated by the sequential cache complexity of FFT (on an input of size m?) if {p? /B) log^j ii? > 
pB log BloglogB, and this is satisfied if > log B log log 5. Hence a tall cache M > 
B"^ log B log log B suffices. 

We will see below that as the costs of this method are dominated by those for FFT in all 
dimensions (work, parallel time, and cache and block misses) this algorithm can be used for FFT. 

{vii) FFT. This is a Type 2 HBP computation with c = 2 and s{n) = ^/n. We have 0(l)-friendly 
tasks with L{r) = 0(1) (outside of the cost to finally convert the BI matrix into the RM output 
format). By Lemma l4.2t the excess block wait cost is O (pi? log n log log -B). At the end, to convert 
to the RM format we use BI-RM for FFT). ■ 



4.6 List Ranking 

We now analyze the cache miss, block miss and idle time cost overheads for the list ranking algorithm 
LR presented in Section [3?2] when scheduled by PWS. For this, we need the following extension to 
the results for SPMS in 



Lemma 4.13. The excess cache miss costs for sorting x items, p < x < pM , using SPMS, when 
scheduled under PWS is hounded by 

/ X pM log X — log X 
C o log ^ . + \/^T 



B X logx/p logx/p^ 
Proof. The excess cache miss cost of a BP collection of size r and total size x is given by 

0(^-|log^ + ^/yp+plogy/p^ r>y/p 

0{p\^ + plogr) r<y/p 

For the sorting problem, there are some c=0(l) collections of size x, 2c of size ^/x, and in general, 
2*c of size for 1 < i < log log x. All these collections have total size 0(x). Summing over all 

values of i yields an excess cache miss cost of 

/x pM logx — logx 

O —log 1 — + ^/xp- — +p logx log logx/p 

\B X logx/p logx/p 
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The result follows, because the second term dominates the third one. 



Corollary 4.4. The cache miss cost in the list ranking algorithm is 0{^ ^°^^.j + log ^ lo§ '^); 
for inputs of size n > Mplog^^^ n, assuming that M > log S log log . 

Proof. In each iteration of the list ranking algorithm there are 0(log^^'^ n) invocations of the sorting 
algorithm. As shown in [TT], the 0{Q) terms over all invocations of the sort procedure sum to 
0(-^ • ^jj)- Thus it suffices to add the costs given by Lemma [4. 131 for halving values of x, starting 

at X = n, and then multiply the resulting sum by log'-'^^ n. 

We consider this sum in more detail. For x < p, these terms contribute a total of 0{plogp) = 

^ / Mp log n B log M \ ^ / Mp log n \ 

'^V~B"IogM M ~ '^'^~B"logA/^ 

For p < X < Mp, The terms 0{^j;^) contribute 0(VpM^^) = Oip^M^) = 
0(2i^l!fM)'WithatancacheM>S2. The terms 0(f.log(Mp/x).^^) contribute 0(^(i2|^+ 

log 2 log(Mp/2) log 4 log Mp/i . log M log p \ f^/ Mp logn n _ 

2 logAf/2 + 4 logAf/4 ^ + M 1 / ~ '^V B logA/> " 

For the block wait cost, we note that with the gapping method, no more block misses occur 
once the list has size at most n/B'^. Recalling that the sort algorithm has a block miss cost of 
0(i?p logn log log (n/p)) to sort a set of n items {12] , we obtain the following block delay cost for 
the list ranking algorithm. 

Lemma AAA. The list ranking algorithm has a block wait cost of 0{Bp\ogn\og\og{n / p) log^*^-* n log B). 

Proof. This bound follows from observing that there are Oilog^^"^ n) sorts on sets of size at most 
n in each iteration of the list ranking algorithm. Further, only the first 0(logi?) iterations incur 
block misses. ■ 



Lemma 4.15. The cache and block wait costs of the list ranking algorithm are bounded by the 
sequential cache miss cost of 0{^^jj) if Mp < n / \og^^~'^'> n and B'^lo^ B\og\ogB < M. 

Proof. The claim regarding the cache miss cost is immediate from Corollary 14.41 To show the claim 
for the block miss cost, by LemmaSTH it suffices to show that 0{Bp log n log log(n/p) log^*^^ n log B) < 
0(f 1^). That is, 0(^2 log Mlog^'^^n log < 0{j^^^^^). Then it suffices to show that 

0(B2iogMlog('')nlogS(loglog5 + log(^)M + log(*=+2)n) < 0{n/p). U Mp < n/ log^''"^) n, then 
it suffices that 0(5^ log M log B(\og log B + log^^^ M)) < M, for which B"^ log^ B log log B = 0{M) 
suffices. ■ 

Finally, we obtain the following result for the list ranking algorithm when scheduled under 
PWS. 

Theorem 4.1. If M > B^ log^ i? log log i? and n > Mp • log'-'^^^^ n, then the list ranking algorithm 
runs in time 

O { - ( nlogn + b ■ sortin)) + sp • log^n log logn 
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Proof. We only need to bound the idle time. For this, we note that the idle time for computation on 
each contracted list of size greater than p is bounded by the corresponding bounds for sorting and 
for scans. When the contracted list has size x < p, the computation proceeds with full parallelism, 
and the idle time is 0((6-plog x log log x), for a total cost of 0((6-plog^ploglogp) for computation 
on all contracted lists of size at most p. Since p < n, this is bounded by the inherent cost of each 
parallel step in the computation, and this establishes the bound in the theorem since it is obtained 
by the bounds derived earlier for the work, critical pathlength, cache miss cost and block miss 
cost. ■ 

The Euler tour and tree computation algorithms have the same complexity since they are simple 
applications of the parallel list ranking algorithm. 

Finally, the dominant cost in the connected components algorithm in pjj is log n stages of list 
ranking, and we obtain resource-oblivious implementation under PWS with both parallel time and 
cache complexity increased by a factor of log n. 

4.7 Distributed PWS implementation 

We present a simple distributed implementation of PWS. This implementation assumes that each 
core context switches from its actual computation to perform 0(1) computation on the distributed 
implementation every k time units, for a suitable k. If /c is a constant, then each core will devote 
a constant fraction of its run time to tending to scheduling issues. 

Our implementation supports steals by maintaining two full binary trees on p leaves, the steal 
tree S and the task tree T, with the ith leaf in each tree for the ith. core. The steal tree and task tree 
support a prefix sums BP computation on a steal array S'fL.p] and task array T[l..p] respectively. 
There is a pointer from S[i] to a location that is set to 1 if process i needs to steal, and similarly 
there is a pointer from T[i] to the task at the head of core i's task queue that is available to be 
stolen. 

The scheduling proceeds in phases, each taking O(logp) steps, and a steal request by a core will 
be processed in the phase that starts after the current phase completes. The ith core is responsible 
for the computation at leaf i, and at most one pre-assigned internal node in each tree. A scheduling 
phase starts in tree S, where each active core i (i.e, core that needs to steal) assigns a 1 to S[i]. 
The remaining positions are assigned 0. The p cores compute prefix sums in O(logp) steps, after 
which each active core knows its rank among the active cores. Then, the computation proceeds 
to the task tree. Those tasks in T[i] with priority matching the priority of the current round are 
assigned a 1, and the remaining leaves are assigned 0. A prefix sums computation on this tree 
assigns ranks to the tasks available to be stolen. The steals and tasks then match up by rank and 
the steals proceed to be executed, while the scheduling moves to the next phase. 

One additional case that can occur in the task tree computation arises when a non-idle core i 
needs to assign the priority of task T[i] in the task tree, but the task queue of core i is empty, and 
the core has not yet generated its next forked task, if any. In this case core i assigns the priority of 
the node it is currently executing minus 1 as the priority of the task for T[i], and it sets a flag to 
indicate that this value is an upper bound on the priority of a task that has not yet been generated 
on the task queue. If this priority matches the priority of the current round, then the stealing cores 
wait till either a task is generated on core i's task queue or core i becomes idle; otherwise this 
flagged task is ignored since its priority is lower than that for the current phase. 
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4.7.1 Analysis 

Since each step may entail a non-local read, the time for a step is at least the time for G(l) cache 
misses. The only block miss costs are those that occur in accessing the arrays storing S and T. At 
the beginning of the computation, each core requests space to store the entries at the leaves and 
internal nodes in S and T that it accesses. This is a total of at most four items to a core, and a 
different block is assigned to each core by our space allocation property. Hence accessing any item 
incurs 0(1) block wait cost. Thus, the delay to a stealing core is 0{blogp) within this scheduling 
computation. 

Block Wait Time at Execution Stacks. As mentioned above, there could be an additional 
delay associated with a steal, and this is the block wait time that could be incurred by the stealing 
core while it waits for a core to place a task on a task queue when that queue is empty. Our analysis 
on block wait costs in an HBP computation bounds this delay to 0{B), thus resulting in the total 
delay due to block waits in an HBP computation with critical pathlength Too being 0{j)B • T^xd). 
Hence the overall cost of all steals is 0{pb{B + \ogp)Too)- 

Reducing the Block Wait Cost of Steals. We describe here a method to reduce the overhead 
of the cost of steals in HBP computations by using padded BP and HBP computations, which were 
defined in Definitions 13.31 and 13.41 

All of our analyses for cache and block misses continue to hold for padded BP and HBP com- 
putations, since each empty array is placed between two different segments on an execution stack, 
and this can only reduce the access costs. Additionally, block wait costs are reduced to 0(1) at 
nodes of height Vl{\ogB) in any BP computation, since when the allocated empty arrays are of 
length greater than i?, there is no interaction between accesses to segments for different nodes on 
the stack. As a result, the total block wait of all steals under PWS in a padded BP computation 
of size r reduces to 0(p6(min{r log r, B log -B}). This cost is absorbed in the block miss cost of the 
actual execution of the BP computation as shown in Lemmas 14. 81 and 14. 91 Similarly, by Lemma [4. 21 
the block wait cost of steals under PWS in Type 2 padded HBP computations for all algorithms 
we consider in this paper is dominated by the block wait cost incurred by the actual computations. 

Thus, if we use padded BP and HBP computations for our algorithms then the overhead for the 
scheduling and the steals under our PWS distributed implementation is O{ph\ogp-Too)- Hence, the 
excess in the cost of cache and block misses in this PWS implementation over the cost of the cache 
and block misses in the HBP computation is the same as its cache miss excess when considering 
only the cost of cache misses in the HBP computation. 

We note that if we use padded BP and HBP computations in place of standard BP and HBP 
computation, the block miss costs decrease even in the actual computations. However, the reduction 
in block misses in the downpass of each BP computation occurs only at higher levels of the BP 
tree, where the inherent cache miss cost of the computation is high, as shown in our analysis in 
Section In the up-pass, again as shown in Section the total block miss cost in a standard 
BP computation is dominated by the block miss cost of the downpass, at least for a = 1/2, and 
a padded BP computation offers no benefit beyond a constant factor. Padded computations may 
come in useful in other HBP algorithms, for instance those that use a > 1/2. 

Other simple implementations are possible for PWS, and may be more practical when the input 
size is much larger than the number of cores. For instance, by using a 'prisoner type' computation 
[16) . each scheduling round can complete in 0{logp) steps, with only the active cores involved in 
the computation (in contrast to the first method where each core would need to devote a constant 
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fraction of its time to the scheduling process). Here an active core is one that needs to steal or one 
for which the task at the head of its task queue has changed. The available tasks will be grouped 
in sets, one for each priority. This second method would be more efficient as long as variation in 
processor delays is not too large, since each core would need to wait the maximum time at each 
level of the tree where its sibling is not active, in order to ensure that no value is being propagated 
up the tree by the sibling. 

There is one more point to note: if computation by the cores can continue during the execution 
of steal requests, a task available at the start of a phase may no longer be available when it is 
assigned to be stolen. We observe that this does not affect our algorithmic analysis. Let us call 
such a task a pseudo-stolen task. Then, for each priority d, there will be at most p — 1 tasks of 
priority d that are stolen or pseudo-stolen. The remainder of the analysis now proceeds as before. 

5 Discussion 

5.1 Other Mechanisms to Handle Block Interference 

For many of our algorithms, the tall cache requirement that T(B) be in excess of is imposed by 
the block misses. Here we consider some strategies that could be utilized by the system hardware 
or software that could potentially mitigate this cost. 

1. 2-Core Block Sharing. It would be helpful if the operating system could help prevent ping- 
ponging by allowing one core to do all its writes before the other core accesses a pairwise shared 
block. The second core would then face a more acceptable delay of 0{B) operations and one cache 
miss. This might be implemented by a lock with a delayed release, though one would need to be 
careful to avoid deadlock. However, this is beyond the scope of the present paper. 

2. Many-Core Block Sharing. This occurs when there are tasks with very small memory 
footprint. Then, inevitably, multiple tasks will be writing into the same block. This seems to be 
unavoidable in an oblivious setting, for the algorithm designer cannot set a minimum task size in 
terms of the block sizeH 

In fact, this issue seems to be even more salient in the context of a hierarchically organized 
collection of cores (see below), an organization that seems inevitable as the number of cores grows. 
For that scenario, a mechanism was proposed in [11] whereby the algorithm provides the run time 
scheduler a space bound for each task, and using this, a method for scheduling tasks effectively 
on a multi- level memory hierarchy was given, (see also the discussion in Section [5.2p . A similar 
mechanism could be used to avoid stealing unduly small tasks, thereby eliminating the multi-way 
block sharing costs for the most part. 

5.2 Hierarchy of Caches 

We have so far assumed that the caching environment in the multicore consists of a private cache 
for each core, all of which access data from an infinite shared memory. As the number of cores p 
gets large it is to be expected that a caching hierarchy of d levels would be present, where each 
cache at level i is shared by some number of caches at level i — 1, d > i > 2. In actual practice 

^Of course, the algorithms designer may create a minimum size to balance the operation overhead of task creation; 
these would be costs found even in a uniprocessor execution of that algorithm and even without considering a cache- 
oblivious execution. 
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(i = 2 is a common configuration currently, where each core has a private cache (Li) of size Mi and 
ah p caches share a shared cache L2 of size M2 > p ■ Mi, and this shared cache accesses data from 
the infinite global shared memory. 

A simple (but non-optimal way) of utilizing a level-z cache Qi that is shared by k level-(z — 1) 
caches is to consider Qi as being partitioned into k disjoint segments of equal size, and assigning 
each segment to one of the level-(z — 1) caches. If the properties assumed by the sequential cache 
oblivious analysis hold at every cache in this hierarchy, then our algorithms can guarantee optimal 
use of this partitioned cache hierarchy. This follows from the results for sequential cache-oblivious 
algorithms, since each private cache can be viewed as having its own cache hierarchy consisting 
of a proportionate portion of each cache it shares at each level of the hierarchy. This observation 
is also noted in However, one could hope to do much better by having all cores that share a 
given level-i cache compute on different parts of the same subproblem so that the data is utilized 
more effectively. Given the competing demand for private versus shared caches, this appears to be a 
challenging task in a completely resource-oblivious environment. It appears that a mechanism, such 
as that proposed in [llj, is needed. The approach there is to have a resource-oblivious specification 
of the algorithm along with some basic 'hints' to the run-time scheduler, which allows for optimal 
utilization of shared caches at all levels of the cache hierarchy by a scheduler that is aware of the 
cache sizes. 

5.3 HBP Algorithms in a Bulk-Synchronous Environment 

A bulk-synchronous environment is used in [21 EQ] to develop efficient, parameter-aware multicore 
algorithms. We observe here that balanced HBP algorithms can be mapped efficiently on to these 
models, as described below. 

If the computation is a pure BP computation of size n, then we fork logp levels of recursion to 
obtain p tasks of size n/p which are mapped on to the p cores. There is a synchronization after 
the p cores complete this part of the computation, followed by logp supersteps to complete the 
computation for the top logp levels. (This can be further refined to obtain a better bound as a 
function of L and g, using the usual BSP methodology, if these parameters are used in the model). 

For an HBP computation, we unravel recursive calls until the first level of recursion where the 
size of the subproblem becomes smaller than the cache size, and at least p parallel recursive tasks 
have been generated, where p is the number of cores if only private caches are considered, and is 
the number of parallel caches at this level of the cache hierarchy, otherwise. We then complete the 
entire computation in terms of subproblems of this size, each of which is computed cache efficiently 
on the cores. The larger BP computations within this sequence of computations are handled as 
described in the above paragraph. On a multi-BSP we perform this process of determining recursion 
level at each level of the cache hierarchy, starting with the top level. On a multicore with a private 
cache, executing bulk-synchronously, this computation can be made cache-oblivious by unfolding 
the recursion to the first level where the total number of subproblems become p or greater (rather 
than using the cache size to determine the level of recursion). 

We note that multicore-oblivious versions of many of the HBP algorithms we have presented 
are given in [TT] for multi-level cache hierarchies, and some of these algorithms are quite similar to 
the corresponding algorithms in |20] : all of these algorithms operate along the lines of the method 
we have described above. 
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6 Conclusion 



Our results are essentially optimal, subject to the algorithm with which we are starting, except for 
the modest overhead for the steals (the cost for sp). 

We have observed that many well-known algorithms are intrinsically HBP algorithms (and hence 
limited access). The one exception is Depth-n-MM, where we used the modified version given in 
[T3j instead of the algorithm in [17^ because the latter, being in-place, has n writes to each of the n? 
output locations, and hence is not limited access. The modified algorithm in [13] has the same work 
and cache complexity, but achieves limited access by using local variables, and BP computations 
for copying. The same approach can be used to obtain limited access versions of I-GEP and LCS 
[8]. In all cases, we retain the work and cache bounds, while losing the in-place property, though 
the additional space used is for local variables, and this space is re-used during the computation. 
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